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Abstract 

Randomized algorithms that base iteration-level decisions on samples from some pool are 
ubiquitous in machine learning and optimization. Examples include stochastic gradient de- 
scent and randomized coordinate descent. This paper makes progress at theoretically evalu- 
ating the difference in performance between sampling with- and without-replacement in such 
algorithms. Focusing on least means squares optimization, we formulate a noncommutative 
arithmetic-geometric mean inequality that would prove that the expected convergence rate of 
without-replacement sampling is faster than that of with-replacement sampling. We demon- 
strate that this inequality holds for many classes of random matrices and for some pathological 
examples as well. We provide a deterministic worst-case bound on the gap between the discrep- 
ancy between the two sampling models, and explore some of the impediments to proving this 
inequality in full generality. We detail the consequences of this inequality for stochastic gradient 
descent and the randomized Kaczmarz algorithm for solving linear systems. 

Keywords. Positive definite matrices. Matrix Inequalities. Randomized algorithms. Random 
matrices. Optimization. Stochastic gradient descent. 



1 Introduction 

Randomized sequential algorithms abound in machine learning and optimization. The most fa- 
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Shalev-Shwartz and Srebro, 2008), but other popular methods include algorithms for alternating 



projections (see Strohmer and Vershynin, 2009 ; [Leventhal and Lewis 2010), proximal point meth- 
ods (see Bertsekas 2011[), coordinate descent ( see Nesterov, 2010) and derivative free optimiza- 
tion (see Nesterov, 2011; Nemirovski and Yudin, 1983). In all of these cases, an iterative procedure 



is derived where, at each iteration, an independent sample from some distribution determines the 
action at the next stage. This sample is selected with-replacement from a pool of possible options. 

In implementations of many of these methods, however, practitioners often choose to break the 
independence assumption. For instance, in stochastic gradient descent, many implementations pass 
through each item exactly once in a random order (i.e., according to a random permutation). In 
randomized coordinate descent, one can cycle over the coordinates in a random order. These strate- 
gies, employing without-replacement sampling, are often easier to implement efficiently, guarantee 
that every item in the data set is touched at least once, and often have better empirical performance 



than their with-replacement counterparts (see Bottou 



2009 
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Feng et al. 



1 



Unfortunately, the analysis of without-replacement sampling schemes are quite difficult. The 
independence assumption underlying with-replacement sampling provides an elegant Markovian 
framework for analyzing incremental algorithms. The iterates in without-replacement sampling 
are correlated, and studying them requires sophisticated probabilistic tools. Consequently, most 
of the analyses without-replacement optimization assume that the iterations are assigned deter- 
ministically. Such deterministic orders might incur exponentially worse convergence rates than 



randomized methods (Nedic and Bertsekas, 2000), and deterministic orders still require careful 



estimation of accumulated errors (see Luo, 1991 Tseng, 1998). The goal of this paper is to make 
progress towards patching the discrepancy between theory and practice of without-replacement 
sampling in randomized algorithms. 

In particular, in many cases, we demonstrate that without-replacement sampling outperforms 
with-replacement sampling provided a noncommutative version of the arithmetic-geometric mean 
inequality holds. Namely, if Ax, . . . , A n are a collection of d x d positive semidefinite matrices, we 
define the arithmetic and (symmetrized) geometric means to be 



M A :~- 



1 n 1 

- Yl Ai ' aIld M ° := ~\ Yl X • • • X Ax(n 



i=l 



<r£S n 



where S n denotes the group of permutations. Our conjecture is that the norm of Mq is always less 
than the norm of (Ma) 11 - Assuming this inequality, we show that without-replacement sampling 
leads to faster convergence for both the least mean squares and randomized Kaczmarz algorithms 



of Strohmer and Vershynin (2009). 



Using established work in matrix analysis, we show that these noncommutative arithmetic- 
geometric mean inequalities hold when there are only two matrices in the pool. We also prove 
that the inequality is true when all of the matrices commute. We demonstrate that if we don't 
symmetrize, there are deterministicaily ordered products of n matrices whose norm exceeds H-MaII™ 
by an exponential factor. That is, symmetrization is necessary for the noncommutative arithmetic- 
geometric mean inequality to hold. 

While we are unable to prove the noncommutative arithmetic-geometric mean inequality in full 
generality, we verify that it holds for many classes of random matrices. Random matrices are, in 
some sense, the most interesting case for machine learning applications. This is particularly evi- 
dent in applications such as empirical risk minimization and online learning where the the data are 
conventionally assumed to be generated by some i.i.d random process. In Section |4j we show that 
if Ax, . . . , A n are generated i.i.d. from certain distributions, then the noncommutative arithmetic- 



geometric mean inequality holds in expectation with respect to the A{. Section 4.1 assumes that 
A{ = ZiZj w.b.here Z\ have independent entries, identically sampled from some symmetric distri- 



bution. In Section 4.2, we analyze the random matrices that commonly arise in stochastic gradient 



descent and related algorithms, again proving that without-replacement sampling exhibits faster 
convergence than with-replacement sampling. We close with a discussion of other open conjectures 
that could impact machine learning theory, algorithms, and software. 



2 Sampling in incremental gradient descent 

To illustrate how with- and without-replacement sampling methods differ in randomized optimiza- 
tion algorithms, we focus on one core algorithm, the Incremental Gradient Method (IGM). Recall 
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that the IGM minimizes the function 



minimize fix) = > fi(x) 



(2.1) 



via the iteration 

x k = asfc-i - 7feV/i fe (a; fc _i) . (2.2) 

Here, a;o is an initial starting vector, 7^ are a sequence of nonnegative step sizes, and the indices it 
are chosen using some (possibly deterministic) sampling scheme. When / is strongly convex, the 



IGM iteration converges to a near-optimal solution of (2.1) for any xq under a variety of step-sizes 



protocols and sampling schemes including constant and diminishing step-sizes (see Anstreicher 



and Wolsey, 2000, Bertsekas, 2012 Nemirovski et al. 2009). When the increments are selected 



uniformly at random at each iteration, IGM is equivalent to stochastic gradient descent. We use 
the term IGM here to emphasize that we are studying many possible orderings of the increments. 
In the next examples, we study the specialized case where the /j are quadratic and the IGM is 



equivalent to the least mean squares algorithm of Widrow and Hoff (1960). 



2.1 One-dimensional Examples 

First consider the following toy one-dimensional least-squares problem 

1 n 

minimize - ^^(^ — Vi) 2 ■ 



(2.3) 



where yi is a sequence of scalars with mean \i y and variance a 2 . Applying (2.2) to (2.3) results in 
the iteration. 

x k = - 7fe(x fc _i - y ik ) 

If we initialize the method with xq = and take n steps of incremental gradient with stepsize 
7/t = 1/k, we have 



1 

n = ~ / Hi. 



where ij is the index drawn at iteration j. If the steps are chosen using a without-replacement 
sampling scheme, x n = fj, y , the global minimum. On the other hand, using with-replacement 
sampling, we will have 



E[(X n - jlyf 



<J- 



n 



which is a positive mean square error. 

Another toy example that further illustrates the discrepancy is the least-squares problem 



1 n 

minimize - A( x — vY 



where are positive weights. Here, y is a scalar, and the global minimum is clearly y. Let's 
consider the incremental gradient method with constant stepsize 7fc = 7 < min/ST" 1 . Then after n 
iterations we will have 



\x n - y\ 



Mllt 1 -^: 

3=1 
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If we perform without-replacement sampling, this error is given by 

n 

\x n -y\ = \y\ JJ(i - 7ft) • 

On the other hand, using with-replacement sampling yields 

(n 
U i=l 

By the arithmetic-geometric mean inequality, we then have that the without-replacement sample is 
always closer to the optimal value in expectation. This sort of discrepancy is not simply a feature 
of these toy examples. We now demonstrate that similar behavior arises in multi-dimensional 
examples. 




2.2 IGM in more than one dimension 

Now consider IGM in higher dimensions. Let be a vector in R d and set 

Hi = ajx* + uji for i = 1, . . . , n 

where aj £ M. d are some test vectors and Wj are i.i.d. Gaussian random variables with mean zero 
and variance p 2 . 

We want to compare with- vs without-replacement sampling for IGD on the cost function 



minimize y^(ajx — 



(2.4) 



i=l 



Suppose we walk over k steps of IGD with constant stepsize 7 and we access the terms i±, . . . , z& 
in that order. Then we have 

x k = x k _ x - ja^ia^Xi^ - y ik ) = (I - -fa ik aj k ) x ik _ x + ja ik y ik . 

Subtracting x* from both sides of this equation then gives 

x k - x* = (I - ja ik a[ k ) (x fc _i - a?*) + ^a ik u ik 



(2.5) 



Here, the product notation means we multiply by the matrix with smallest index first, then left 
multiply by the matrix with the next index and so on up to the largest index. 

Our goal is to estimate the risk after k steps, namely E[||xfe — #*|| 2 ], and demonstrate that this 
error is smaller for the without-replacement model. The expectation is with respect to the IGM 
ordering and the noise sequence Wj. To simplify things a bit, we take a partial expectation with 
respect to ojf. 



E[\\x k - 



E 



II ( J -~f a h a l) (ao-a;*) 



+ A 2 E E 



II i 1 - ^ij^i) a i 



k>j>£ 



(2.6) 
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In this case, we need to compare the expected value of matrix products under with or without- 
replacement sampling schemes in order to conclude which is better. Is there a simple conjecture, 
analogous to the arithmetic-geometric mean inequality, that would guarantee without-replacement 
sampling is always better? 

2.3 The Randomized Kaczmarz algorithm 

As another high-dimensional example in the same spirit, we consider the randomized Kaczmarz 



algorithm of Strohmer and Vershynin ( 2009 ) . The Kaczmarz algorithm is used to solve the over- 

y. Here <1> is an n x d matrix with n > d and we assume 
satisfying $>x+ = y. Kaczmarz's method solves this system by 



determined linear system &x - 
there exists an exact solution x 



Stark 1987). 



alternating projections (Kaczmarz, 1937) and was implemented in the earliest medical scanning 



devices (Hounsfield, 1973). In computer tomography, this method is called the Algebraic Recon- 



struction Technique (Herman, 1980 Natterer, 1986) or Projection onto Convex Sets (Sezan and 



Kaczmarz's algorithm consists of iterations of the form 



Xk+l = x k + 



'Ik ■ 



(2.7) 



where the rows of $ are accessed in some deterministic order. This sequence can be interpreted as 
an incremental variant of Newton's method on the least squares cost function 



minimize }^(cf)Jxk — Ui 



i=l 



with step size equal to 1 (see Bertsekas, 1999). 



Establishing the convergence rate of this method proved difficult in imaging science. On the 



other hand, Strohmer and Vershynin (2009) proposed a randomized variant of the Kaczmarz 



method, choosing the next iterate with-replacement with probability proportional to the norm 
of (pi. Strohmer and Vershynin established linear convergence rates for their iterative scheme. 



Expanding out (2.7) for k iterations, we see that 



Xk 




\\M 



[x - x* 



Let us suppose that we modify Strohmer and Vershynin's procedure to employ without-replacement 
sampling. After k steps is the with-replacement or without-replacement model closer to the optimal 
solution? 



3 Conjectures concerning the norm of geometric and arithmetic 
means of positive definite matrices 

To formulate a sufficient conjecture which would guarantee that without-replacement sampling 
outperforms with-replacement, let us first formalize some notation. Throughout, [n] denotes the 



5 



set of integers from 1 to n. Let D be some domain, / : D fc — >■ 
from D. We define the without-replacement expectation as 



(n-fc)! 
n! 



\, and (xi, . . . , x n ) a set of n elements 



That is, we average the value of / over all ordered tuples of elements from (xi, . . . , x n ). Similarly, 
the with-replacement expectation is defined as 



Ewr[/(>ii,- • . ,x ifc )] = n" 



f( x ji ) • • • ; x ifc) • 

(iivj'fc)=i 



With these conventions, we can list our main conjectures as follows: 

Conjecture 3.1 (Operator Inequality of Noncommutative Arithmetic and Geometric Means) 

Let Ai, . . . ,A n be a collection of positive semidefinite matrices. Then we conjecture that the fol- 
lowing two inequalities always hold: 



3=1 
k k 

n A ik~j+i n A -i 

3=1 3=1 



< 



< 



E. 



k 
3=1 



3=1 



3=1 



(3.1) 



(3.2) 



Note that in (3.1), we have E^Q^- Ay] = (I ^ A,) fc = (M A ) fe . 



Assuming this conjecture holds, let us return to the analysis of the IGM (2.6). Assuming that 



Xq — is an arbitrary starting vector and that (3.2 ) holds, we have that each term in this summa- 



tion is smaller for the without-replacement sampling model than for the with-replacement sampling 
model. In turn, we expect the without-replacement sampling implementation will return lower risk 



after one pass over the data-set. Similarly, for the randomized Kaczmarz iteration (2.7), Conjec- 



ture |3.1| implies that a without-replacement sample will have lower error after k < n iterations. 

In the remainder of this document we provide several case studies illustrating that these non- 
commutative variants of the arithmetic-geometric mean inequality hold in a variety of settings, 
establishing along the way tools and techniques that may be useful for proving Conjecture 3.1 in 
full generality. 

3.1 Two matrices and a search for the geometric mean 



Both of the inequalities (3.1) and (3.2) are true when n = 2. These inequalities all follow from an 



well-estabilished line of research in estimating the norms of products of matrices, started by the 



seminal work of Bhatia and Kittaneh (1990). 



Proposition 3.2 Both (3.1) and (3.2) hold when n = 2 



Proof Let A and B be positive definite matrices. Both of our arithmetic-geometric mean inequal- 
ities follow from the stronger inequality 



IABII < 



W A + \ B \ 



(3.3) 



6 



This bound was proven by Bhatia and Kittaneh (2000). In particular, since 



\\AB + \BA\\ < \\AB\ 



(3.1) is immediate. For (3.2), note that 

l^wo [-^ii -A-i A{ x ] 



\ AB 2 A + \BA 2 B 



(3.4) 



and 



E wr [A h AlA h ] = \A A + \AB 2 A + \BA 2 B + \B A 



We can bound (3.4) by 



UAB 2 A + BA 2 B\\ < \\AB 2 A\ 



\AB\\ 2 < \\\A+ \B\\ A 



[\A+\Bf\ 



Here, the first inequality is the triangle inequality and the subsequent equalit y fo llows because the 
norm of X T X is equal to the squared norm of X. The second inequality is (3.3). 
To complete the proof we show 

X L := {\A + \Bf < \A A + \AB 2 A + \BA 2 B + \B A := X R 

in the semidefinite ordering. But this follows by observing 

X R -X L =y" Q(p,q)A p{2) A p{1) A q{1) A q{2) (3.5) 



P6[2]2ge[2]2 



where Q(p,q) = 3/16 if p = q and —1/16 otherwise. Since p and q both take 4 possible values, 
the matrix Q is positive definite which means that Xr — Xl can be written as a nonnegative sum 
of products of the form YY T . We conclude that Xr — Xl must be positive define and hence 
1 1^0? 1 1 > ||-Xz,||> completing the prooj^] ■ 

Note that this proposition actually verifies a stronger statement: for two matrices, the arithmetic- 
geometric mean inequality holds for deterministic orderings of two matrices. We will discuss below 
how symmetrization is necessary for more than two matrices. In fact, considerably stronger in- 
equalities hold for symmetrized products of two matrices. As a striking example, the symmetrized 
geometric mean actually precedes the square of the arithmetic mean in the positive definite order. 
Let A and B be positive semidefinite. Then we have 

(i A +\Bf- {\AB + \BA) = \A 2 + \B 2 - \AB - \BA ={\A-\B) 2 ^. 

This ordering breaks for 3 matrices as evinced by the counterexample 
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1 1 
1 1 



1 1 
1 1 



The interested reader should consult Bhatia and Kittaneh (2008) for a comprehensive list of in- 
equalities concerning pairs of positive semidefinite matrices. 



An explicit decomposition (3.5 1 into Hermitian squares was initially found using the software NCSOSTools 
by |Cafuta et al.| ( |2011[ ). This software finds decompositions of matrix polynomials into sums of Hermitian squares. 
Our argument was constructed after discovering this decomposition. 
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Unfortunately, these techniques are specialized to the case of two matrices, and no proof cur- 
rently exists for the inequalities when n > 3. There have been a varied set of attempts to extend 
the noncommutative arithmetic-geometric mean inequalities to more than two matrices. Much of 
the work in this space has focused on how to properly define the geometric mean of a collection 



of positive semidefinite matrices. For instance, Ando et al. (2004) demarcate a list of properties 



desirable by any geometric mean, with one of the properties being that the geometric mean must 
precede the arithmetic mean in the positive-definite ordering. Ando et al derive a geometric mean 
satisfying all of these properties, but the resulting mean in no way resembles the means of matrices 
discussed in this paper. Instead, their geometric mean is defined as a fixed point of a nonlinear 
map on matrix tuples. Bhatia and Holbrook (2006) and Bonnabel and Sepulchre (2009) propose 



geometric means based on geodesic flows on the Riemannian manifold of positive definite matrices, 
however these means also do not correspond to the averaged matrix products that we study in this 
paper. 



3.2 When is it not necessary to symmetrize the order? 



When the matrices commute, Conjecture |3.1| is a consequence of the standard arithmetic-geometric 
mean inequality (more precisely, a consequence of Maclaurin's inequalities). 

Theorem 3.3 (Maclaurin's Inequalities) Let xi, . . . ,x n be positive scalars. Let 



-i 



En- 

Qc[n] iefi 
|«|=fc 



be the normalized kth symmetric sum. Then we have 

si > >...> "7s n _i > 



Note that s\ > v/s~^ is the standard form of the arithmetic-geometric mean inequality. See Hardy 



et al. (1952) for a discussion and proof of this chain of inequalities. 



To see that these inequalities immediately imply Conjecture |3.1| when the matrices Aj are 
mutually commutative, note first that when d = 1, we have 



n 

i=l 



E v 



n<z[n] ien 
\n\=k 



i=i 



lb 



The higher dimensional analogs follow similarly. If all of the Aj commute, then the matrices are 
mutually diagonalizable. That is, we can write Aj = UhiU T where U is an orthogonal matrix, 
and the Aj = diag(A^, . . . , \£) are all diagonal matrices of the eigenvalues of Aj in descending 
order. Then we have 

k 



i=l 



i=l 



E wo n A i° <E wr n A i 



CO 



1=1 



1=1 



i=l 
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and we also have 



E v 



.i=i 



E„ 



i=l 



and 



E v 



.i=i 



E v 



i=l 



verifying our conjecture. In fact, in this case, any order of the matrix products will satisfy the 
desired arithmetic-geometric mean inequalities. 

3.3 When is it necessary to symmetrize the order? 

In contrast, symmetrizing over the order of the product is necessary for noncommutative operators. 



The following example, communicated to us by Aram Harrow, provides deterministic without- 
replacement orderings that have exponentially larger norm than the with-replacement expectation. 
Let oj n = ir/n. For n > 3, define the collection of vectors 



a kyr 



cos (fcu„) 
sin (koj n ) 



(3.6) 



Note that all of the a k - n have norm 1 and, for 1 < k < n, (a^. n , afc+i ;n ) = cos(w n ). The matrices 
Afc := (Zfc. n aT are all positive semidefinite for 1 < k < n, and we have the identity 



1 n 

-£A fc = ij. 
n f— ' z 



(3.7) 



fc=i 



Any set of unit vectors satisfying (3.7) is called a normalized tight frame, and the vectors (3.6) form 



a harmonic frame due to their trigonometric origin (see Hassibi et al. , 2001 ; Goyal et al. , 2001 ) 
The product of the Aj is given by 

k k-l 

Y\ A, = a k . n aj. n Y[(aj;n,a j+1;n ) = a k , n aj. n cos fc_1 (u n ) , 



and hence 



1=1 



cos fe 1 (w„) > 2 w cos fc '(w n ) 



fc-i, 



fc=l 



Therefore, the arithmetic mean is less than the deterministicaily ordered matrix product for all 
n > 3. 

It turns out that this harmonic frame example is in some sense the worst case. The following 
proposition shows that the geometric mean is always within a factor of d k of the arithmetic mean 
for any ordering of the without-replacement matrix product. 

Proposition 3.4 Let A\, . . . , A n be d x d positive semidefinite matrices. Then 



.i=i 



< d h 



a=i 
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Proof If we sample j\ , . . . , jk uniformly from [n] , then we have 



i=l 



trace 



< E v 



[J trace ( A,-. } 



.4 = 1 



< 



1 = 1 



n 



i=i 



Here, the first inequality follows from the triangle inequality. The second, because the operator 
norm is submultiplicative. The third inequality follows because the trace dominates the operator 
norm. The fourth inequality is Maclaurin's. The fifth inequality follows because the trace of a d x d 
positive semidefinite matrix is upper bounded by d times the operator norm. The final inequality 
is again the triangle inequality. ■ 

Note that this worst-case bound holds for deterministic orders of matrix products as well. 
Once we apply the submultiplicative property of the operator norm, all of the non-commutativity 
is washed out of the problem. Examples of deterministic matrix products saturating this upper 
bound can be constructed in higher dimensions using frames. If d even, set 



and for odd d 



fk+l 



fk+1 



d ' 



T T 



J_ ~ T 

>/2' 



T 

' a (d-l)k;n 



' a (d-l)k;n 



for k = 0, 1, . 



for k = 0, 1, 



, n 



1 



(3. 



(3.9) 



Then one can verify again using standard trigonometric identities that 

1 n 1 



k=l 



and that the inner products of adjacent fi are 

t 2 cos ((d/2 - IK) sin(( ^Lt\ )aJ " } " I cos K) d even 

d odd 



fi fi+l 



| cos {{d - 1) /%j n ) Si *((<*+W^») 



:(rf+i)A 

sin(w n ) 



1 



These inner products are approximately 1 — - ^ 2 ^ for large n. Thus, each of these cases violate 
the arithmetic-geometric mean inequality for the order (1, 2, . . . , k) by a factor of approximately d k 
provided n > d. 

At first glance, the harmonic frames example appears to cast doubt on the validity of Con- 
jecture 3.1 However, after symmetrizing over the symmetric group, we can show that the d = 2 
harmonic frames do obey (3.1). 

Theorem 3.5 Let A(n) = 2^3 



1 -n/2 + 1/2 -n/2 
1/2 -n + 1 : " 



l);n 



. With the a^-n defined in (3.6), 
\{n)2- n I, and 1 > A(n) = ©(n" 1 ) . 



crgS 1 ,! i=l 
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This theorem additionally verifies that there is an asymptotic gap between the arithmetic and 
geometric means of the harmonic frames example after symmetrization. We include a full proof of 
this result in Appendix [Bj The proof treats the norm variationally using the identity that ||-X"||2 
is the maximum of v T Xv over all unit vectors v. Our computation then reduces to effectively 
computing a Fourier transform of the function of v in an appropriately defined finite group. We 
show that the Fourier coefficients can be viewed as enumerating sets, and we compute them exactly 
using generating functions. 

The combinatorial argument that we use to prove Theorem 3.5 is very specialized. To provide 
a broader set of examples, we now turn to show that Conjecture |3.1| does in fact hold for many 
classes of random matrices. 



4 Random matrices 

In this section, we show that if A\ 



Conjecture 
where Z. 



3.1 



A n are generated i.i.d. 



from certain distributions, then 
assumes that Aj = Z{Zf 

have independent entries, identically sampled from some symmetric distribution. In 



holds in expectation with respect to the A{. Section 



Section 4.2 we explore when the matrices Aj are random rank-one perturbations of the identity as 



was the case in the IGM and Kaczmarz examples. 

4.1 Random matrices satisfy the noncommutative arithmetic-geometric mean 
inequality 

In this section, we prove the following 

Proposition 4.1 For each i = l,...,n, suppose A\ = ZiZj with Z\ a d x r random matrix 



whose entries are i.i.d. samples from some symmetric distribution. Then Conjecture 3.1 holds in 
expectation. 

Proof Suppose the entries of each Z% have finite variance a 2 (the theorem would be otherwise 
vacuous if we assumed infinite variance). Let the (a, h) entry of be denoted by Z^. Also, denote 
by W the matrix with all of the Z% stacked as columns: W = a [Zi, . . . , Z n \. 



Let's first prove that (3.1) holds in expectation for these matrices. First, consider the without- 
replacement samples, which are considerably easy to analyze. Let (ji,---,jk) be a without- 
replacement sample from [n]. Then 



E 



NEW, 
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For the arithmetic mean, we can compute 

i±4 



r -k a -2k 



E 



> 



k a 2k d 



trace E 



= E 



d 1 trace 



nra 



1 n 

— 2 J2 A * 

, n z Z ✓ 



1=1 



E 



d 1 trace 



i=l 

1 



(4.1) 



=d- 1 (nr)- fc E E t W «x ) 6i W^^Wa,,^^,,,, . . . W ak , bk W aii 

{ai,...,a k }=l {6i,...,6 fe }=l 
d nr 

--{nr)- k E nWi M W a2M W a2M W a3 , b2 ...W ak , hk W hbk }. 

{a 2 ,...,a k }=l {6i,...,6 fc }=l 



(4.2) 



Note that since Wij are iid, symmetric random variables, each term in this sum is zero if it contains 
an odd power of Wij for some i and j. If all of the powers in a summand are even, its expected 
value is bounded below by 1. A simple lower bound for this final term (4.2) thus looks only at the 
contribution from when all of the indices cij are set equal to 1. 



(nr)- k nwl bl Wl b2 ...Wl b 
{6!,...,6 fe }=l 



E 



' nr 

nr ^-^ 



6=1 



> E 



1 nr 

nr ^ • 



6=1 



1. 



Here the inequality is Jensen's. This calculation proves (3.1) for our family of random matrices. 
That is, we have demonstrated that the expected value of the with-replacement sample has greater 
norm than the expected value of the without-replacement sample. 

To verify that (3.2) holds for our random matrix model, we first record the following property 
about the fourth moments of the entries of the Aj. Let £ := E[G^] 1//4 . Then we can verify by direct 
calculation that 



31 J 



V(r - 1)<T 4 + r£ 4 {h,j 2 } = {12, J2} and i\ = i 2 
ra 4 {ii, j 2 } = {t2,j2} and i\ / i 2 

otherwise 



(4.3) 



A consequence of this lemma is that E[A?] = (r(r + d — l)c 4 + r^ 4 )/^- Using this identity, we 
can set £ := r(r + d — l)cr 4 + r£ 4 and we then have 

E[V(h, i k )] = E[A ik . . . A? . . . A ik ] = CE[A lk ...A\... A lk \ = ■ ■ 



C k I d 



We compute this identity in a second way that describes its combinatorics more explicitly, which 
we will use as to derive our lower bound. 



E[V u , v (h,...,i k )\ =E 



k-l 



I (*2fc-j+l) 



3=2 

k-l 



1 fefc-j + l) 



P e[d] 2k 



3=2 
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The second equality uses linearity coupled with the fact that i\, . . . , if- are distinct, hence E[A^A^P , 

Ef^u^ ] EL4^/^/] since elements from distinct matrices are independent. Many of the terms in this 
sum contain odd powers which are zero. Using the fact that A = A T , we see that all terms that 
are non-zero must contain only products of two forms: A 2 IU or A^ v - Then, we can write the sum: 

k 

V u , v (i 1: ...,i k )= £ E[(A^) 2 ]nE[(^i«) 3 ] 

P e[d] k 3=2 

Now consider the case that some index may be repeated (i.e., there exist k, I such that ij = i\ 
for j 7^ I). The key observation is the following. Let w be a real- valued random variable with a 
finite second moment. Then, 

E[w 2p ] > E[w 2 ] p for p = 0, 1, . . . , n (4.4) 

With equality only for p = 0,1. This is Jensen's inequality applied to x p for x > (since w is real 
then w 2 is positive, and x p is convex on [0, oo) for p = 0, 1, 2, . . . ). To verify the inequality, let ni 
be the number of times index i is repeated and observe 

fc-i 

l(*2JV) 



E[V u Jh,...,i k )]=E 



■ A {il) A' 

pm 2 



n > 

7=2 

A { f. 



Pfe-l)>P(»j) P(«2fe-j),Pfefc-j + l) J 



> 



pe[d]* 



k 



P0'-l):P0') y 



(4.5) 



(4.6) 



pe[d] k 



3=2 



(4.5) follows from (4.3), since all terms are non- negative. (4.6) inequality is repeated application 
of (4.4). The final expression is precisely equal to the without-replacement average. Now, the with 
replacement average can be bounded as 

k k 
II Ai k-j+i II A i 



3=1 



3=1 



> ^E v 
a 



trace 



i=i 



E wr [Ui,i(ii, 



Since each term in this last expression exceeds the without-replacement expectation, this completes 
the proof. ■ 



The arguments used to prove Theorem 4.1 grossly undercount the number of terms that con- 
tribute to the expectation. Bounds on the quantity (4.1) commonly arise in the theory of random 
matrices (see the survey by Bai, 1999 for more details and an extensive list of references). Indeed, 
if we let d = 8n and assume that Wij have bounded fourth moment, we have that (4.1) tends to 
(1 + ^/~5) 2k almost surely an—)- oo. That is, the gap between the with- and without-replacement 
sampling grows exponentially with k in this scaling regime. Similarly, there is an asymptotic, ex- 
ponential gap between the with and without-replacement expectations in (3.2). Observe that (4.4) 
is strict for p > 2 for x-squared random variables. Thus, for Wishart matrices, if there is even a 
single repeated value, i.e., ij = i\ for j / I, inequality (4.6) is strict. In Appendix [A| we analyze the 
case where the Z% are Gaussian (and hence the Ai are Wishart) and demonstrate that the ratio of 

lGk 



the expectation is bounded below by re 4fc ( fe+1 ) 



e 2 r(r+d+l) 
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4.2 Random vectors and the incremental gradient method 



We can also use a random analysis to demonstrate that for the least-squares problem (2.4), without- 
replacement sampling outperforms with-replacement sampling if the data is randomly generated. 

Let's look at one step of the recursion (2.5) and assume that the a, are sampled i.i.d. from 
some distribution. Assume that the moments A := K[aiaf] and A := E^djlpdjd? 1 ] exist. Then 
we see immediately that 

E W o[||a;fc - x±\\ 2 ] = E wo [x k - x±] T (I - 2jA + 7 2 A) E wo [x k - a;*] + p 2 7 2 trace(A) 

because aj k is chosen independently from (o^ , . . . , dj k _ 1 ) On the other hand, in the with-replacement 
model, we have 

E wr [||a; fc - £c*|| 2 ] = E wr [(x k - x±) T (I - 2 7 A„ + j 2 A n )(x k - a;*)] + p 2 7 2 trace(A) 

where 

j n 1 n 

A n := - y^djdf and A n := - V] ||dj \\ 2 aiaJ . 

i=l i=l 

In this case, we cannot distribute the expected value because the vector x — x± depends on all dj 
for 1 < % < n. To get a flavor for how these differ, consider the conditional expectation 

Ewr [||£Cfe - a^H 2 I {a-i}] < (l - 2 7 A mm (A„) + 7 2 A max (A„)) E wr [\\x k -i - x+\\ 2 | {dj}] 

+ p 2 ^ 2 trace(A) 

Similarly, 

E wo [||a; fc - x*\\ 2 } < (1 - 2 7 A min (A) + 7 2 A max (A)) E wo [\\x k ^ - x*\\ 2 } + p 2 7 2 trace(A) . 
Expanding out these recursions, we have 

E wo [\\x k - x4 2 ] < (1 - 2 7 A min (A) + 7 2 A max (A)) fc E wo [\\x - x*\\ 2 ^ - pV^a) - 



2A min (A)- 7 A 

max (A) 



\x k - x*\\ 2 | {a.i}] < (1 - 27A min (A n ) + 7 2 A max (A n )) E wr [||a; fe - x+\\ 2 \ {m}] 



+ 



p 2 7tracc(A n ) 
2A m i n (A n ) — 7A max (An) 



Now, since a i°^i is positive definite and since A m in is concave concave on Herniitian matrices, 



we have by Jensen's inequalty that 



E[A min (A n )] = E 



> i=l / J V L i=l J / 



Amin (A) 



and, since A max is convex for symmetric matrices, 



E [A max (A n )] = E 



- WaifaiaJ j >A max |E - \\ai\\ 2 aiaj j = A max (A) . 
J 1 i=i / J V L n i=i J / 



This means that the with-replacement upper bound is worse than the without-replacement estimate 
with reasonably high probability on most models of dj. Under mild conditions on dj (including 
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Gaussianity, bounded entries, subgaussian moments, or bounded Orlicz norm), we can estimate tail 
bounds for the eigenvalues of A n and A n (by applying the techniques of Tropp[ 2011 for example). 
These large deviation inequalities provide quantitative estimates of the gap between with- and 
without-replacement sampling for the least mean squares and randomized Kaczmarz algorithms. 
Similar, but more tedious analysis, would reveal that with-replacement sampling fares worse with 
diminishing step sizes as well. 



5 Numerical Evidence 



As described in the introduction, there is substantial numerical evidence that with-replacement 
sampling underperforms without-replacement sampling in many randomized algorithms. We invite 



the interested reader to consult Bottou (2009); Recht and Re (2011); Feng et al. (2012), and many 



other articles in the machine learning literature to substantiate these empirical claims. However, 
for completeness, we provide a few examples demonstrating the gap for the examples in Section [2j 
In Figure [TJ we display six comparisons of with- and without-replacement sampling. In the first 
row, we show the discrepancy when running the randomized Kaczmarz algorithm when the rows 



of are the ci-dimensional, defined by (3.8). In the second row, we plot the results for incremental 
gradient descent with = /j in the same harmonic frames example. Finally, the third row plots 
performance when the rows of are generated i.i.d. from Haar measure on the sphere. In all three 
cases, without-replacement sampling converges faster than with-replacement sampling, and when 
d and n are close, the convergence rate is considerably faster. 



6 Discussion and open problems 

While i.i.d. matrices are of significant importance in machine learning, the major piece of open work 



is proving Conjecture 3.1 for all positive semidefinite matrix tuples or finding a counterexample for 
either of the assertions. As demonstrated by the harmonic frames example, symmetrized products 
of deterministic matrices become quickly tedious and difficult to study. Some sort of combinatorial 
structure might need to be exploited for a short proof to arise in general. It remains to be seen 



if this sort of combinatorics employed in proving Theorem 3.5 generalizes beyond this particular 
example, but we expect these techniques will be useful in future studies of Conjecture |3.1[ In 



particular, it would be interesting to see if we could reduce the proof of the conjecture to verifying 
the conjecture on frames that arise as the orbit of the representation of some finite group. These 
frames have been fully classified by Hassibi et al. (2001), and would reduce Conjecture 3.1 to a 
finite list of cases. 



A further conjecture and its consequences The generalization of (3.3) to n > 3 asserts a 



stronger version of (3.1) 



k 



< 



n 

i=i 



2 k 



(6.1) 



Certainly, (3.1) follows from (6.1) by Jensen's inequality the triangle inequality. Moreover, using 



the same argument we used in proving Proposition 3.2, (3.2) also follows from (6.1). When n > 3 



is it the case that (6.1) holds? It could be that for general matrices, it is easier to analyze (6.1) 
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Figure 1: Comparison of with- and without-replacement sampling on the examples from 
Section [2j In the first column, $ is 100 x 105. In the second column, it is 100 x 200. The first 
row is the randomized Kaczmarz algorithm with $ being the harmonic frame defined in (3.8). 
The second row is running incremental gradient from Section |2.2| with p — 0.01 where is 
also from the harmonic frame of (3.8). The final row is running Kaczmarz again, this time 
with vectors generated uniformly from Haar measure on the sphere in R d . 
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rather than (3.2) because the right hand side is in terms of the arithmetic mean, rather than the 
more complicated quadratic matrix products in (3.2). 



Effect of biased orderings. Another possible technique for solving incremental algorithms is to 
choose the best ordering of the increments to reach the cost function. In terms of matrices, can we 
find the ordering of the matrices Aj that achieves the minimum norm. At first glance this seems 
daunting. Suppose A, = aiaj where the are all unit vectors. Then for a £ S n 



i=l 



a(i) 



minimizing this expression with respect to a amounts to finding the minimum weight traveling 
salesman path in the graph with weights log |(aj, dj)\. Are there simple heuristics that can get 
within a small constant of the optimal tour for these graphs? How do greedy heuristics fare? This 



sort of approach was explored with some success for the Kaczmarz method by Eldar and Needell 

poTTl). 



Nonlinear extensions Extending even the random results in this paper to nonlinear algorithms 
such as the general incremental gradient descent algorithm or randomized coordinate descent would 
require modifying the analyses used here. However, it would be of interest to see which of the 
randomization tools employed in this work can be extended to the nonlinear case. For example, if 
we assume that the cost function ( |2.1| ) has summands which are sampled i.i.d., can we use similar 
tools (e.g., Jensen's inequality, moment bounds) to show that without-replacement sampling works 
even in the nonlinear case? 
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A Additional calculations for random matrices 



-2k 4fc 

2A;! 



For the special case of Wishart matrices, we can show that the gap between the norm of the 
arithmetic and geometric means in 3.2 is quite large. 

Lemma A.l For i = 1, . . . , k, we have 

E g [V(i,...,i)) > \\x\\ 2 r2- 
Proof 

d k-1 

E[V(xi, > J2 x l E 3l A $uA$ u II A$ U A$ U ] 

u=l j=2 

= \\x\\ 2 E[(A^ u ) 2k ] 

The first inequality is because all the terms are positive and we are selecting out only the self loops. 
The equality just groups terms. The following lower bound completes the proof. 



E[(AL) 2 *] = E 



E 



2k 
ii, . . . , l r 



1=1 



> 



i=i 



2k\ 



A simple corollary is the following lower bound on the arithmetic mean 

E wr E[V(h, i k )\ > k- k \\x\\ 2 r2- 2k ^a 4k 

We examine the following ratio p(r, k,d) 

= E,fr[C(,, fa)] « + d + _ 

E wo Eg[G(x,ii, . . . ,i k )\ 2k\ 

For fixed r, d, p grows exponentially with k. 

Lemma A. 2 For k,r,d>0 then 

i / 16k 

p(k,r,d) > re 4k ( k + 1 '> 



e 2 r(r + d + 1] 



Proof We use a very crude lower and upper bound pair that holds for all k ( Cormen et al. , 2009 
p. 55). 

f^e^TKkK V2nk { -] * e l ' 2k 



With this inequality, we can write: 

p(k, r,d)>r expU ln(4/c/e) 4 - k ln(2k/e) 2 - k hx(4kr(r + d + 1)) + — — } 

2k 2k + 1 

( 4?k 11/ 16Jfe 



rexp^fcln^- — - — — — ^ = r — — — e 4fc(fc+i) 

\ e 2 r{r + d + l) ik(k + 1) J \e 2 r(r + d+ 1)/ 
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B Proof that harmonic frames satisfy the noncommutative arithmetic- 
geometric mean inequality 



Problem Let S be the symmetrized geometric mean of a set of rank 1, idempotent matrices that 
are parametrized by angles <f>i, ■ ■ ■ ,<t> n (This is slightly more general than we need for our theorem 
above). Our goal is to compute the 2-norm of S: 

j n— 1 

max v T Sv = max — V cos(4> v - 4> a(1) ) x TT cos(4> a(i) - <t> a u +1 )) x cos{<f> a t n ) - <j) v ) 

v.\\v\\=l </.„6p,2ir n! 

(Tfc5 n i—l 

B.l Cosine combinatorics 

We will write this function as fourier transform (we pull out the 2 n for convenience): 

2™ max v T Sv = \] c k cos(knn~ 1 ) + maxNJ d\ cos{(f> v + lirn^ 1 ) (B.l) 

u:||u|| =1 <f>v ~^ 

k I 

To find the and ci^, we repeatedly apply the following identity: 

cos x cosy = 2~ 1 (cos(x + y) + cos(a; — y)) 

Fix 4>i, . . . , 4>2, ■ ■ . , 4> n , ■ ■ ■ G [0, 2-7!"]. We first consider a related form, T n , for n = 1, 2, 3, ... , 
defined by the following recurrence 

Ti = 1 and T n+1 = T n cos(0 n - ip n+1 ) 

We compute T n using the above transformation. But, first, we show the pattern by example: 

Example B.2 





= 1 














T 2 


= cos(0>i 


-V> 2 ) 












n 


= COs(0>l 


-V'a) 


+ cos(^i - 20 2 


+ ^3) 










= cos(0>i 




+ cos(0>i - 2^3 


+ ^4) 


+ COs(V>l - 2V>2 + 2^3 - -04) 


cos(^i — 


2^2 + ^4) 


T 4 


= COs(0>l 


-^4) 


+ cos(^i - 20 2 


+ 20> 3 


- 4 ) + COS^i - 203 + Y>4) 


+ cos(V>i 


- 20>2 + ^4 


T 5 


= cos(0>i 


-V'5) 


+ cos(0>i - 2V>2 


+ 20> 3 


- ^5) + ^(V'l - 20> 3 + V'5) 


+ COS(V>l 


- 2-02 + V'5 




= cos(0i 


-2^4 


+ ^5) + cos(V>i 


- 20*2 


+ 2^3 - 2-04 + V'5) + cos(V'i 


"203 + 


204 - 05 ) 



+ COS(01 - 2-02 + 20>4 - 05) 



In our computation above, ipi = <f> v = ip n . And so, after writing this out, we will get two kinds of 
terms: even terms (corresponding to c^) that do not depend on ip v (they cancel) and odd terms that 
do contain 2ip v . 

We encapsulate this example in a lemma: 
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Lemma B.l With T n as defined above, we have for n>2 



T n = 2~ n E E cos (^ - + 2 ^ 2 - ■ ■ ■ + ("i)* 2 ^* + (-!) fe+ Vn) 

k h:---,ik 

Kii<i2<---<ik<n 



Proof By induction, we have: 



T n+1 = 2- n Y E COS ^ " + 2 ^ " • • • + (-1) Vi* + ("l) fc+ Vn) CO S (^ n - Vn+l) 

I<il<i2<---<ifc<" 

= 2 -(n+l)^ £ cos ( V , 1 _ 2 ^ 1 +2^ 2 ---- + (-l) fc V lfc + (-l) /£+ Vn+l) 

Kii<i2<---<jfe<ri 

+ cos(V>i - 2^ + 2V> 12 - • • • + (-l)Vi* + 2(-l) fe+ Vn + (-l) fc+ Vn+i) 

= 2" ( " +1) E E cos (^ - 2 ^n + 2^ 2 " • • • + (-I)Vh + (-l) fc+ Vn+i) 

fe »i, ...,i fe 

Kii<i2<---<«fc<n+l 



Fix an n. We now count a symmetrized version of T n denned as follows: For a £ S n : 



Sn = n\ S II cos H*) - *(* + !)) 

' <TSS n 1=1 

We now show that 5 n can be written in a form that removes the permtuation. We also assume 
some structure here that mimics our product above, namely that <\>\ = <j) n . 

Lemma B.2 Let <j>i, . . . , (f) n € [0, 2ir] such that (j>\ = 4> n . Then, 



X,YC[n]:\X\=\Y\ or |X|=|y|+l V 1 1 7 \ \ieX jeY J J 

Proof To see this formula, Consider a pair of sets X,Y C [n\. In how many permutations a G S n 
does (X,Y) contribute a term? We need to choose \X\ + \Y\ positions for these terms to appear 
out of n possible places in the order. Thus there are (|_x-|+|y|) permutations to choose the slots for 
(X,Y). An (|X|, \Y\) pair only appears in a permutation in a if the elements of X and Y can be 
alternated starting with X. This implies that \X\ = 2\Y\ + Zi where Zi 6 {0}0, 1. Moreover, there 
are |X|!|y|!(n — (\X\ + \Y\\)\ permutations that respect this structure (for any choice of \X\ + \Y\ 
slots, any ordering of X and Y and the elements outside can occur). 

( m ; m )|x| r |.(„-( W + |y|) ! = „!(W + l F l)" 1 



Pushing the 1/n! factor inside completes the proof. ■ 
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B.3 Counting on harmonic, finite groups 

In the case we care about, the 0, have more structure: the set {2(/>j}™ =1 forms a cyclic group under 
addition modulo 2tt. Let n denote the number of elements in the frame. Fix n. Let £ denote a nth 
root of unity. Define a (harmonic) generating function / 



f((,y,z) = H(i + Cx + c l y) 

1=1 

We give a shorthand for its coefficients Qk,m and Vk^m as follows 

qk,m ■= [C k x m y m ]f and r k , m := [Q k x m+1 y m ]f 
We observe that qk,m computes the number of sets (X, Y) where X,Y CZ n such that: 

1. J^iex * ~~ J2jeY 3 = ^ mod n (since we inspect 

2. |X| = |V| = m (since we inpect x m y m ), 

For rfc im the only change is that \X\ = \Y\ + 1 (since x m+1 y m ). With this notation, we can express 
the coefficients from Eq. B.l 



m \ ' m \ ' 

We use this representation to prove that the symmetrized geometric mean is rotationally in- 
variant (i.e., dk = for k = 0, 1, . . . , n — 1). First, we show that all dt are equal. 

Lemma B.3 Consider a frame of size n. For any m and k, I = 0, . . . , n— 1, dk cos(</>„ + 27r£;) = 
0. 

Proof This follow by examining the generating function above. First observe that we have 
congruence f(x,x J y,z) = f(x,y,z) for j = 0, ...,n — 1 tells us that [x J yz m ]f = [yz m ]f. And, 
the congruence that [x^yz m ]f = [x 3 y z m ]f. Combining these facts, we have that r^^m = r^ m . 
Since this holds for all k,l, we can conclude that dk = di by summing over m. Finally, since 
Sr=o cos (^ + 2/vrn _1 ) = for any fixed <f> v we conclude the lemma. ■ 

Since the symmetrized geometric mean does not depend on <p v , we conclude it must be of the 
form al for some a. The remainder of this note is to compute that a. 

B.4 Computing the coefficients 



The argument of this subsection is a generalization of that of Konvalina (1995). 
Lemma B.4 

qk,m = (-l) kl 



k J n — k 
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Proof Define R n as: 



R n (x,y) = + (-VT ~ E ~ k k ) ^~ k (*y) h 



Since /(£,x,y) = f(( l ,x,y) for any integer n, F n (x,y) = /(£, x, y) is a function of n alone. 
That is, we can write 

F n (x,y)= Hil + Cx + C'y) 
<eu n 

Thus, claim boils down to F n (x, —y) = R n (x,y). 

We show that the zero sets of F n (x, —y) and R n are equal. The zero set of F n (x, —y) is the set 
of lines described by 

{{x, y) | y = £ + £ 2 x} for ( e U n 

where £ is any n-th root of unity. Substituting y at the root equation, we get that xy = x( + ( 2 x 2 . 
Now, we check that the following is zero: 



fc 

Here, we use the generating function: 



Using this sum, we have: 



k=0 



l-(2Cx + l)V + ^! + (2C^ + l) 



2 

=(C*) n + (i + C*) n 

=x n + (-2/)" 

The first equality follows from 1 + 4<^x + 4^ 2 x 2 = (2x£ + l) 2 . The second is just algebra. Finaly, 
we use on each term that ( n = 1 and that ( + £ 2 x = —y. This claim holds for all £ that are roots 
of unity, and so the function is identically zero. 

To conclude the proof, observe that the zero set described above is the union of n lines of the 
form (1 + £x + £ These lines are unique in C: if (1 + £x + £ _1 y) = (1 + ojx + ui~ l y) then since 
the x coefficients are the same £ = ui and so they must be the same. By direct inspection, this R n 
can only have these factors (else the total degree would be higher). Hence, R n = Q n . ■ 
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B.5 Finally, to a hypergeometric series 

It is possible to get an explicit formula for A that is related to 3F2. We consider the following 
and show that it is hypergeometric in k: 

-1 



Consider the ratio: 
v(k + l) 



v(k) 



(n- fe-l)!((fe + l)!) 2 



2k\(n — 2k)\k\{n — k) 



k + l!(n - 2k - 2)!(n - jfe - l)(2fc + 2)! (n - A;)!(fc!) 2 
(n-2ifc)(n-2A;-l)(fc + l) 



(n-fc-l)(2ife + 2)(2fe + l) 
(A; - n/2){k - n/2 + l/2){k + 1) 
(fe-n + l))(fc + l/2)(fc + l) 



And so, this is a hypergeometric: 
This completes the proof. 



1 -n/2 + 1/2 -n/2 
1/2 -n + 1 



;i 



= 0(n" 1 ) 
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